# Human-Human Commensality Dataset

## Overview
A novel audio-visual dataset capturing human social eating behaviors of groups of three people sharing a meal. It contains multi-view RGBD video and directional audio recordings of 30 sessions, totaling over 18 hours of multistream, multimodal recordings of 90 people, and provides the following data:
- **ROS bags with topics:** 4x mic audio, mixed audio, sound direction, per-participant RGBD, and scene RGBD.
- **Raw data (extracted from ROS bags):** scene audio, sound direction, per-participant videos, and scene videos.
- **Processed data (extracted from raw data):** per-participant speaking status, per-participant face and body keypoints from OpenPose, per-participant gaze, and head pose from RT-GENE, per-participant bite count, and per-participant times since the last bite lifted and since the last bite delivered to mouth.
- **Annotations:** per-participant interactions with food, drink, and napkins (all entered, lifted, delivered to mouth, and mouth open events), per-participant food type labels, and observations of interesting behaviors.

Please see Section 5 (and the referenced Appendix sections) of [our paper](https://arxiv.org/abs/2207.03348) [1] for the data collection study setup, data annotation details, pre- and post-study questionnaires, and data statistics.


## Naming conventions
The 30 recorded sessions are denoted by session IDs from 01 to 30. The session ID 00 corresponds to the pilot study. 

We organize the data extracted from the ROS bag files into a `raw` folder (e.g., audio and video files) and further processed/extracted data into a `processed` folder (e.g., audio features).

## Documentation pictures
Pictures capturing the study setup details can be found in the folder `documentation-pictures`. 

## ROS bag files and associated extraction scripts
Due to size limitations, we provide the ROS bag files externally at https://cornell.box.com/v/human-human-commensality-bag<br>
For each session there is one ROS bag file with the filename `{session_ID}.bag` and it contains the following ROS topics:
 - /audio - mixed audio
 - /audio/channel0 - mixed audio
 - /audio/channel1 - mic 1 audio
 - /audio/channel2 - mic 2 audio
 - /audio/channel3 - mic 3 audio
 - /audio/channel4 - mic 4 audio
 - /camera0/aligned_depth_to_color/image_raw/compressed  
 - /camera0/color/image_raw/compressed     
 - /camera1/aligned_depth_to_color/image_raw/compressed  
 - /camera1/color/image_raw/compressed                   
 - /camera2/aligned_depth_to_color/image_raw/compressed  
 - /camera2/color/image_raw/compressed                   
 - /camera3/aligned_depth_to_color/image_raw/compressed  
 - /camera3/color/image_raw/compressed                   
 - /sound_direction - in degrees (see the documentation pictures)

camera0 denotes the scene camera. 
The microphone and camera positions are shown on the documentation pictures. 

Unfortunatelly, the audio ROS topics are missing for session ID 09.

The scripts to extract video and audio data based on the specified ROS topics can be found in the folder `bag-extraction-scripts`.

## Video files
The video files (.mp4) are located in the folder `raw/video`. They were extracted using the script `bag-extraction-scripts/extract_all_bags_video.sh` and by selecting the ROS topics `/camera{participant_position}/color/image_raw/compressed` where the participant position 0 denotes the video of the whole scene and the participant position 1, 2, and 3 corresponds to each participant's video. 

For each session there are four video files, each corresponding to one camera. The video filenames are of the form `{session_ID}_{participant_position}.mp4`. For the pilot study (session ID 00) there was no scene camera.

## Audio files
The mixed audio files (.wav) are located in the folder `raw/audio`. They were extracted using the script `bag-extraction-scripts/extract_all_bags_audio.py` and by selecting the ROS topic `/audio`. For each session there is one file with the filename `{session_ID}.wav`.

The audio sound direction data (.csv) can be found in the folder `raw/sound-direction`. It was extracted using the script `bag-extraction-scripts/extract_all_bags_sound_direction.sh` and by selecting the ROS topic `/sound_direction`. For each session there is one file with the filename `{session_ID}.csv`.

Unfortunatelly, all the audio data is missing for session ID 09.

## Annotation files
The annotation instructions and conventions we defined can be found in Appendix of our paper and also in `annontation/Annotation_Instructions.docx` with further details.

The annotations of paticipants' types of food and observations of interesting behaviors are recorded in `annontation/Annotation_FoodTypes_And_Observations.xlsx` and replicated in `annontation/Annotation_FoodTypes_And_Observations.tsv` for easier data processing.

The ELAN annotation files (.eaf) for each video (except for the scene camera videos) are located in the folder `annotation/annotation-files`.
The filenames are of the form `{session_ID}_{participant_position}.eaf`.
The subfolder `annotation/annotation-templates` contains the ELAN template file specifying the tier names and tier types we created.

## Questionnaires
The data from the pre-study and post-study questionnaires is located in the folder `questionnaires`.
The replies to open-ended post-study questions can also be found in `questionnaires/dataset_collection_openended_quetions_replies.txt` in an easy to read format.

## Features (processed data)
Features extracted from the `raw` data are located in the following folders:
- `processed/audio-features` 
  - `aligned-audio`<br>
    Each .npy file is of size (T, 1024), where T is the number of video frames in a session. Because Respeaker ROS has audio streaming at roughly 15 frames per second while video is streamed at roughly 30 frames per second, we upsample by assigning the nearest neighbor timestamp to the missing frame. This causes some repeated assignments of a single audio frame to 1-3 video frames, but this emulates how data is received in real-time on the robot. All features in this folder are aligned and upsampled using this procedure. Each frame in T is 64ms in length.
  - `sound-direction`<br>
    Each .npy file states which direction sound is coming from in the Respeaker microphone array. For each session, we take the ROS messages received under the "/sound_direction" ROS topic, and align them to their respective video frames based on timestamps. 
  - `binary-speaking`<br>
    Each .npy file contains whether a given frame to the corresponding video has a person speaking or not. This is computed using a Python interface for Google's WebRTC project (https://github.com/wiseman/py-webrtcvad). Each .npy file contains a list whose length is same number as the frames in the corresponding videos, and each value indicates whether someone is talking (1) in that frame or not (0).
  - `person-speaking`<br>
    This is similar to the binary-speaking labels, however we cluster the sound-direction annotations to determine who is speaking. We cluster the angles using KMeans, sort them, and label them as 1, 2, and 3 to designate participants 1, 2, and 3 as speaking. If the annotation is a 0, then no one is talking. These labels simply combine the clustered speaker directions with binary speaking labels. 
  - `mfcc`<br>
    Each .npy file consists of a (T x 5 x 13) MFCC array computed from a every frame of aligned-audio. Since each audio frame is 64 ms long, a sliding window of size 25 ms with a 10 ms stride leads to 5 windows. At each window, the 13 highest energy cepstral coefficients are provided.
  - `logmel`<br>
    Each .npy file consists of a (T x 5 x 128) log-mel spectogram. Similar to the MFCC, we have 5 sliding windows where 128 Mel bands are computed.
  - `logfb`<br>
    Each .npy file consists of a (T x 5 x 26) log filter bank (logfb) features. For each aligned-audio sample, we compute 26 log filter bank coefficients for 5 windows over the 64 ms aligned-audio sample.
- `processed/count-features`<br>
  For each participant, there is a .csv file that contains the number of food_lifted and food_to_mouth annotations that have appeared on or before a given timestamp.
- `processed/time-features`<br>
  For each participant, there is a .csv file that contains the time since the last food_lifted and food_to_mouth annotations were made at a given timestamp.
  The "zeroed within" variant denotes that the time feature was set to zero within the annotation. Otherwise, the time feature started increasing immediately from the beginning of the annotation.
- `processed/visual-features`<br>
  - `rt-gene`<br>
    Each .csv file contains gaze and headpose estimations from RT-GENE (https://github.com/Tobias-Fischer/rt_gene). It has three columns:
    - name: The frame number of the video which the gaze corresponds to. Note that not all frames have gaze/head pose estimations. In these instances, the user may want to interpolate or approximate values.
    - gaze: A two-number vector representing the gaze.
    - headpose: A two-number vector representing the head pose.
- `OpenPose visual features` (face and body keypoints) can be found at https://drive.google.com/drive/folders/1gJL4gJ1IxBcuQuns5Fy6Hmhe7IcrvPdF?usp=share_link

The filenames of audio features are of the form `{session_ID}` and they are missing for session ID 09.<br>
The filenames of count, time, and visual features are of the form `{session_ID}_{participant_position}`.


## Models
Model weights can be found at https://drive.google.com/drive/folders/1gJL4gJ1IxBcuQuns5Fy6Hmhe7IcrvPdF?usp=share_link<br>
Please see Section 4 and Appendix 8.2 of [our paper](https://arxiv.org/abs/2207.03348) [1] for details on model architectures and training.

---
## References

[1] Ondras, Jan, Abrar Anwar, Tong Wu, Fanjun Bu, Malte Jung, Jorge Jose Ortiz, and Tapomayukh Bhattacharjee. "Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups." In 6th Annual Conference on Robot Learning. 2022. https://openreview.net/forum?id=7ZcePvChS7u

    @inproceedings{ondras2022humanrobot,
      title={Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups},
      author={Jan Ondras and Abrar Anwar and Tong Wu and Fanjun Bu and Malte Jung and Jorge Jose Ortiz and Tapomayukh Bhattacharjee},
      booktitle={6th Annual Conference on Robot Learning},
      year={2022},
      url={https://openreview.net/forum?id=7ZcePvChS7u}
    }